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Abstract 

We  describe  a  simple  approach  for  integrating 
shallow  and  deep  parsing.  We  use  phrase  struc¬ 
ture  bracketing  obtained  from  the  Collins  parser 
as  filters  to  guide  deep  parsing.  Our  exper¬ 
iments  demonstrate  that  our  technique  yields 
substantial  gains  in  speed  along  with  modest 
improvements  in  accuracy. 

1  Introduction 

The  detailed  linguistic  analyses  generated  by  deep 
parsing  are  an  essential  component  of  spoken  dia¬ 
log  systems  that  collaboratively  perform  tasks  with 
users  (e.g.,  (Allen  et  al.,  2001)).  For  example,  inter¬ 
pretation  in  the  TRIPS  collaborative  dialog  assistant 
relies  on  the  representation  produced  by  its  parser 
for  word  sense  disambiguation,  constituent  depen¬ 
dencies,  and  semantic  roles  such  as  agent,  theme, 
goal,  etc.  Broad  coverage  unification-based  deep 
parsers,  however,  unavoidably  have  problems  meet¬ 
ing  the  very  high  accuracy  and  efficiency  require¬ 
ments  needed  for  real-time  dialog.  On  the  other 
hand,  parsers  based  on  lexicalized  probabilistic  con¬ 
text  free  grammars  such  those  of  Collins  (1999)  and 
Charniak  (1997),  which  we  call  shallow  parsers1, 
are  robust  and  efficient,  but  the  structural  represen¬ 
tations  obtained  with  such  parsers  arc  insufficient  as 
input  for  intelligent  reasoning.  In  addition,  they  arc 
not  accurate  when  exact  match  is  considered  as  op¬ 
posed  to  constituent  recall  and  precision  and  bracket 
crossing.  For  example,  the  standard  Collins  parser 
yields  an  exact  match  on  only  36%  on  the  standard 
test  set  (section  23)  of  the  Wall  Street  Journal  Cor¬ 
pus. 

In  this  paper  we  explore  the  question  of  whether 
preprocessing  with  a  shallow  parser  can  produce 
analyses  that  arc  good  enough  to  help  improve  the 
speed  and  accuracy  of  deep  parsing.  Previous  work 
on  German  (Frank  et  al.,  2002)  pursued  a  similar 
strategy  and  showed  promising  results  after  consid¬ 
erable  effort  transforming  the  output  of  the  shal- 

1  We  do  not  intend  the  “chunking”  sense  of  shallow  parsing 
—  all  our  parsers  return  tree  structures. 


low  parser  into  useful  guidance  to  the  deep  parser. 
We  were  interested  in  seeing  if  we  could  take  a 
shallow  parser  off  the  shelf,  namely  the  Collins 
parser,  and  use  its  output  fairly  directly  to  improve 
the  performance  of  the  TRIPS  parser.  It  has  been 
reported  that  stochastic  parsers  degrade  in  perfor¬ 
mance  on  domains  different  than  what  they  were 
trained  on  (Hwa,  1999;  Gildea,  2001),  so  there  re¬ 
ally  was  an  issue  whether  the  output  would  be  good 
enough.  In  particular,  we  arc  taking  the  Collins 
parser  trained  on  the  Wall  Street  Journal  and  ap¬ 
plying  it  unchanged  to  spontaneous  human-human 
dialog  in  an  emergency  rescue  task  domain.  We 
have  found  that  there  are  islands  of  reliability  in  the 
results  from  the  Collins  parser  that  can  be  used  to 
substantially  improve  the  performance  of  the  TRIPS 
parser. 

The  remainder  of  the  paper  is  organized  as  fol¬ 
lows.  Section  2. 1  provides  background  on  the  Mon¬ 
roe  corpus,  a  set  of  task-oriented  dialogs  that  is  the 
basis  for  the  parser  evaluations.  In  section  2.2  we 
describe  the  TRIPS  parser  and  the  representation  it 
produces  for  reasoning.  In  section  3  we  describe  the 
preliminary  evaluations  we  carried  out  by  running 
the  Collins  parser  over  the  Monroe  corpus.  We  then 
describe  our  experiments  in  combining  the  parsers 
under  different  conditions.  We  look  at  different  con¬ 
ditions,  first  seeing  how  this  method  can  improve 
overall  parsing  of  our  corpus,  and  then  with  real¬ 
time  parsing  conditions,  as  required  for  spoken  di¬ 
alog  systems.  We  find  we  can  get  substantial  effi¬ 
ciency  improvements  on  the  coipus  parsing,  which 
mostly  disappear  when  we  look  at  the  semi-real- 
time  case.  In  the  latter,  however,  we  do  see  some 
improvement  in  coverage. 

2  Background 

2.1  The  Monroe  Corpus 

Our  data  consists  of  transcribed  dialogs  between 
two  humans  engaged  in  carefully  designed  tasks 
in  simulated  emergency  management  situations  in 
Monroe  County,  New  York  (Stent,  2001).  The  sce¬ 
nario  was  designed  to  encourage  collaborative  prob- 
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U  We  also  have  to  send  a  road  crew  there  as  well 

S  So  we  probably  can’t  actually  send  the  ambulance  over  the  bridge 

U  You’re  probably  right 

U  Because  it’s  going  to  take  another  two  hours 

U  So  we’ll  actually  run  out  of  time  if  we  wait  for  that 

U  So  I  guess  we’ll  need  to  send  them 

U  Actually  could  we  send  them  up  fifteen  across  two  fifty  two  down  three  eighty  three 
U  Take  that  way  around 
S  Wait 

S  The  generator’s  going  downtown 
S  Right 

U  The  generator  is  going  to  two  fifty  two 
S  Oh  oh  I  see  the  problem 
U  So  if  we  go  up  fifteen  or  go  south  on  fifteen 
S  And  then  go  up  three  eighty  three 
U  Two  fifty  two 
S  Three  eighty  three 

U  And  that’ll  get  us  all  the  way  over  to  the  person  with  pneumonia  or  the  person  who  needs  the  generator 
U  Say  at  the  most  it  takes  an  hour 

U  It  should  take  no  more  than  an  hour  to  get  the  generator  over  to  that  person 
S  Okay 

S  So  we  have  the  people  taken  care  of 

Figure  1 :  Excerpt  from  Monroe  dialog 


lent  solving  and  mixed  initiative  interaction  involv¬ 
ing  complex  planning  and  coordination  between  the 
participants,  so  the  communication  is  very  sponta¬ 
neous  and  interactive.  The  corpus  is  split  into  utter¬ 
ances,  and  the  speech  repairs  are  marked  and  auto¬ 
matically  removed  for  these  tests.  Utterances  that 
are  incomplete  or  uninterpretable  (by  humans)  arc 
also  marked  and  eliminated  from  the  corpus.  The 
remaining  utterances  form  the  set  on  which  we  have 
been  developing  and  testing  the  grammar.  Figure  1 
shows  an  exceipt  from  one  of  the  dialogs. 

The  entire  Monroe  corpus  consists  of  20  dialogs 
ranging  from  about  7  minutes  up  to  40  minutes  in 
length.  Our  tests  here  focus  on  a  subset  of  five  di¬ 
alogs  that  have  been  used  to  drive  the  grammar  de¬ 
velopment:  s2,  s4,  sl2,  sl6  and  sl7  (henceforth  di¬ 
alogs  1,  2,  3,  4  and  5),  constituting  1556  parseable 
utterances.2 

2.2  The  TRIPS  Parser 

The  deep  parser  we  used  is  a  robust  parsing  sys¬ 
tem  developed  in  the  TRIPS  system  over  the  past 
five  years  being  driven  from  five  different  domains. 
The  grammatical  formalism  and  parsing  framework 
is  essentially  a  lexicalized  version  of  the  formalism 
described  in  (Allen,  1995).  It  is  a  GPSG/HPSG 
(Pollard  and  Sag,  1994)  inspired  unification  gram¬ 
mar  of  approximately  1300  rules  with  a  rich  model 
of  semantic  features  (Dzikovska,  2004).  The  parser 

2Parseable  utterances  exclude  utterances  that  are  incom¬ 
plete  or  ungrammatical  (see  (Tetreault  et  al.,  2004).) 


is  an  agenda-driven  best-first  chart  parser  that  sup¬ 
ports  experimentation  with  different  parsing  strate¬ 
gies,  although  in  practice  we  almost  always  use  a 
straightforward  bi-directional  bottom-up  algorithm. 
As  an  illustration  of  its  flexibility,  the  modifications 
required  to  perform  this  experiment  required  adding 
only  one  function  of  ten  lines  of  code.  The  grammar 
used  for  these  experiments  is  the  same  TRIPS  gram¬ 
mar  used  in  all  our  applications,  and  the  rules  have 
hand-tuned  weights.  The  weights  of  newly  derived 
constituents  are  computed  exactly  as  in  a  PCFG  al¬ 
gorithm,  the  only  difference  being  that  the  weights 
don’t  necessarily  add  to  1  and  so  arc  not  probabil¬ 
ities.3  The  TRIPS  parser  does  not  use  a  maximum 
entropy  model  (cf.  the  XFE  system  (Kaplan  et  al., 
2004))  because  there  is  insufficient  training  data  and 
it  is  as  yet  unclear  how  such  as  model  would  per¬ 
form  at  the  detailed  level  of  semantic  representation 
produced  by  the  TRIPS  parser  (see  Figure  2  and  dis¬ 
cussion  below). 

The  rules,  lexicon,  and  semantic  ontology  are  in¬ 
dependent  of  any  specific  domain  but  tailored  to 
human-computer  practical  dialog.  The  grammar 
is  fairly  extensive  in  coverage  (and  still  growing), 
and  has  quite  good  coverage  of  a  corpus  of  human- 
human  dialogs  in  the  Monroe  domain,  an  emer¬ 
gency  management  domain  (Swift  et  al.,  2004).  The 

3We  have  a  version  of  the  grammar  that  uses  a  non- 
lexicalized  PCFG  model,  but  it  was  not  used  here  as  it  does 
not  perform  as  well.  Thus  we  are  using  our  best  model,  making 
it  the  most  challenging  to  show  improvement. 


SA_TELL 


-c>  LF::FILL-CONTAINER 


:goal 


->  THE  LF::LAND-VEHICLE 


:content 

:theme 

A  (SET-OF  LF::FRUIT)  —  :subset  — >  THE  (SET-OF  LF::FRUIT) 

:QUANTITY 

[/  :mods 

LF::WEIGHT-UNIT  POUND  - >  LF::NUMBER  - >  LF::QMODIFIER  MIN 

:quan 

:of  -IS  300 


(SPEECHACT  V38109  SA_TELL  : CONTENT  V37618) 

(F  V37618  (LF: : FILL-CONTAINER  LOAD)  : GOAL  V37800  : THEME  V38041 
: TMA  ((TENSE  PAST)  (PASSIVE  +) ) ) 

(THE  V37800  (LF : : LAND-VEHICLE  TRUCK)) 

(A  V38041  (SET-OF  (LF : : FRUIT  ORANGE))  : QUANTITY  V37526  : SUBSET  V37539) 
(QUANTITY-TERM  V37526  (LF :: WEIGHT-UNIT  POUND)  : QUAN  V37479) 
(QUANTITY-TERM  V37479  LF :: NUMBER  :MODS  (V38268)) 

(F  V38268  (LF: :QMODIFIER  MIN)  : OF  V37479  : IS  V37523) 
(QUANTITY-TERM  V37523  LF :: NUMBER  :VALUE  300) 

(THE  V37539  (SET-OF  (LF : : FRUIT  ORANGE))) 


Figure  2:  Parser  logical  form  (together  with  a  graphical  approximation  of  the  semantic  content)  for  At  least 
three  hundred  pounds  of  the  oranges  were  put  in  the  truck. 


system  is  in  active  use  in  our  spoken  dialog  un¬ 
derstanding  work  in  several  different  domains.  It 
operates  in  close  to  real-time  for  short  utterances, 
but  degrades  in  performance  as  utterances  become 
longer  than  8  or  9  words.  As  one  way  to  control 
ambiguity,  the  grammar  makes  use  of  selectional  re¬ 
strictions.  Our  semantic  model  utilizes  two  related 
mechanisms:  first,  an  ontology  of  the  predicates 
that  are  used  to  create  the  logical  forms,  and  sec¬ 
ond,  a  vector  of  semantic  features  associated  with 
these  predicates  that  are  used  for  selectional  restric¬ 
tions.  The  grammar  computes  a  flattened  and  un¬ 
scoped  logical  form  using  reified  events  (see  also 
(Copestake  et  al.,  1997)  for  a  flat  semantic  represen¬ 
tation),  with  many  of  its  word  senses  derived  from 
FrameNet  frames  (Johnson  and  Fillmore,  2000)  and 
semantic  roles  (Fillmore,  1968).  An  example  of  the 
logical  form  representation  produced  by  the  parser 
is  shown  in  Figure  2,  in  both  a  dependency  graph 
(upper)  and  the  actual  parser  output  (lower).4 


4Term  constructors  appearing  at  the  leftmost  edge  of  terms 
in  the  parser  output  are  F  (relation),  A  (indefinite  entity), 
THE  (definite  entity)  and  QUANTITY-TERM  (numeric  ex¬ 
pressions). 


3  Collins  Parser  Evaluation 

As  a  pilot  experiment,  we  evaluated  the  perfor¬ 
mance  of  the  Collins  parser  on  a  single  dialog  of 
167  sentences  from  the  Monroe  corpus,  dialog  3. 
We  extracted  context-free  grammar  backbones  from 
our  TRIPS  gold  standard  parses  to  score  the  Collins’ 
output  against.  The  evaluation  was  complicated  by 
difference  in  tree  formats,  illustrated  in  Figure  3. 
The  two  parsers  use  a  different  (though  closely  re¬ 
lated)  set  of  syntactic  categories.  The  TRIPS  struc¬ 
ture  generally  has  more  levels  of  structure  (roughly 
corresponding  to  levels  in  X-bar  theory)  than  the 
Penn  Treebank  analyses  (Marcus  et  ah,  1993),  in 
particular  for  base  noun  phrases. 

We  converted  the  TRIPS  category  labels  to  their 
nearest  equivalent  in  Penn  Treebank  inventory  be¬ 
fore  scoring  the  Collins  parser  in  terms  of  la¬ 
beled  precision  and  recall  of  constituents,  the  stan¬ 
dard  measures  in  the  statistical  parsing  community. 
Overall  recall  was  32%,  while  precision  was  64%. 
While  we  expect  the  Collins  parser  to  have  low 
recall  (it  generates  fewer  constituents  overall),  the 
low  precision  indicates  that  simply  relabeling  con¬ 
stituents  on  a  one-for-one  basis  is  not  sufficient  to 
resolve  the  differences  in  the  two  formalisms.  Pre¬ 
cision  and  recall  broken  down  by  constituent  type  is 
shown  in  Table  1. 


TOP  UTT 


I  I 

attack;  at  spEC  N1 

I  I 

DET  N 

I  I 

ART  airport 

i 

the 

Figure  3:  Skeleton  tree  output  from  the  Collins  parser  (left)  and  the  TRIPS  parser  (right)  for  I  have  a  bomb 
attack  at  the  airport. 


However,  82%  of  the  sentences  have  no  cross¬ 
ing  brackets  in  the  Collins  parse.  That  is,  while  the 
parser  may  not  generate  the  same  set  of  constituents, 
it  generates  very  few  constituents  that  straddle  the 
boundaries  of  any  constituent  in  the  TRIPS  parse. 
At  this  level,  the  parsers  agree  about  the  structure 
of  the  sentences  to  a  degree  that  is  perhaps  surpris¬ 
ing  given  the  very  different  domain  on  which  the 
Collins  parser  is  trained.  This  indicates  that  the 
low  performance  on  the  other  measures  has  more 
to  do  with  differences  in  the  annotation  style  than 
real  mistakes  by  the  Collins  parser. 

The  high  level  of  agreement  on  unlabeled  brack¬ 
etings  led  us  to  believe  that  the  Collins  structure 
could  be  used  as  a  filter  for  constituents  generated 
by  the  TRIPS  parser.  We  tested  this  strategy  in  ex¬ 
periments  reported  in  the  following  section. 

4  Experiments 

In  all  the  experiments,  we  used  a  subset  of  five  di¬ 
alogs  (consisting  of  1326  utterances)  from  the  Mon¬ 
roe  corpus,  described  in  2.1.  Pilot  trials  were  con¬ 
ducted  on  dialog  3  (167  utterances),  and  the  exper¬ 
iments  were  run  with  the  remaining  dialogs  (1,  2,  4 
and  5). 

4.1  Method 

The  first  experiment  evaluates  whether  we  can  ex¬ 
tract  information  from  the  Collins  output  that  is  reli¬ 
able  enough  to  provide  significant  improvements  to 
the  TRIPS  parser.  In  order  to  compare  our  perfor¬ 


mance  with  (Frank  et  al.,  2002),  the  test  only  uses 
utterances  for  which  we  have  a  gold-standard.  In 
addition,  we  report  our  experiments  only  on  utter¬ 
ances  6  words  or  longer  (with  an  average  of  10.3 
words  per  utterance),  as  shorter  utterances  pose  lit¬ 
tle  problem  for  the  TRIPS  parser  and  thus  running 
the  Collins  pre-processing  step  would  not  be  pro¬ 
ductive. 

We  parsed  dialogs  1,  2,  4  and  5  with  the  Collins 
parser,  and  extracted  the  phrase-level  bracketing  for 
the  most  reliable  constituents  (those  which  has  a 
precision  of  at  least  60%)  in  our  pilot  study:  NP,  VP 
and  ADVP.5  From  this  information  we  constructed 
a  parse  skeleton  for  each  utterance,  such  as  the  one 
shown  in  Figure  4. 

For  our  experiments  we  modified  the  TRIPS 
parser  so  that  when  a  constituent  is  to  be  added  to 
the  chart,  if  the  constituent  type  and  its  start  and  end 
positions  arc  found  in  the  skeleton  then  the  ranking 
for  that  constituent  is  boosted  by  a  small  amount.  In 
pilot  trials  we  determined  the  optimal  boost  weight 
to  be  3%  (see  Table  2). 

With  a  broad  coverage  grammar,  it  is  possible  that 
the  parser  could  run  almost  indefinitely  on  sentences 
that  arc  difficult  to  parse.  Thus  we  set  an  upper  limit 
on  the  number  of  constituents  that  can  be  added  to 
the  chart  before  the  parser  quits.  The  parser  runs 
until  it  finds  a  complete  analysis  or  hits  this  upper 


5The  Collins  parse  time  for  the  309  utterances  of  6  words  or 
longer  was  30  seconds. 


label 

gold 

recall 

produced 

precision 

crossing 

ADJ 

2 

0.0% 

0 

0.0% 

0.0% 

ADJP 

17 

17.6% 

7 

42.9% 

28.6% 

AD  VP 

106 

23.6% 

35 

71.4% 

11.4% 

CD 

17 

0.0% 

0 

0.0% 

0.0% 

DT 

39 

0.0% 

0 

0.0% 

0.0% 

FRAG 

0 

0.0% 

2 

0.0% 

0.0% 

INTJ 

0 

0.0% 

19 

0.0% 

0.0% 

N 

5 

0.0% 

0 

0.0% 

0.0% 

NNP 

5 

0.0% 

0 

0.0% 

0.0% 

NP 

170 

79.4% 

225 

60.0% 

8.9% 

NPSEQ 

5 

0.0% 

0 

0.0% 

0.0% 

NX 

106 

0.0% 

0 

0.0% 

0.0% 

PP 

4 

50.0% 

37 

5.4% 

13.5% 

PRED 

6 

0.0% 

0 

0.0% 

0.0% 

PRT 

0 

0.0% 

2 

0.0% 

0.0% 

QP 

16 

0.0% 

1 

0.0% 

100.0% 

RB 

5 

0.0% 

0 

0.0% 

0.0% 

S 

75 

42.7% 

83 

38.6% 

6.0% 

SBAR 

18 

50.0% 

17 

52.9% 

23.5% 

SBARQ 

0 

0.0% 

1 

0.0% 

0.0% 

SINV 

0 

0.0% 

2 

0.0% 

0.0% 

SPEC 

61 

0.0% 

0 

0.0% 

0.0% 

SQ 

0 

0.0% 

2 

0.0% 

0.0% 

UTT 

185 

0.0% 

0 

0.0% 

0.0% 

UTTWORD 

15 

0.0% 

0 

0.0% 

0.0% 

VB 

6 

0.0% 

0 

0.0% 

0.0% 

VP 

235 

43.8% 

124 

83.1% 

7.3% 

WHNP 

0 

0.0% 

3 

0.0% 

0.0% 

Table  1:  Breakdown  of  Collins  parser  performance  by  constituent  type.  Recall  refers  to  how  many  of  the 
gold-standard  TRIPS  constituents  were  produced  by  Collins,  precision  to  how  many  of  the  produced  con¬ 
stituents  matched  TRIPS,  and  crossing  brackets  to  the  percentage  of  TRIPS  constituents  that  were  violated 
by  any  bracketing  produced  by  Collins. 


So  [NP  I]  [VP  guess  that  if  [NP  we]  [VP  send  [NP  one  ambulance]  to  [NP  the  airport]]  [NP  we]  [VP  can  [VP  get  [NP 
more  people  off]  [AD VP  quickly]]] 

Figure  4:  Skeleton  filter  for  the  utterance  So  I  guess  that  if  we  send  one  ambulance  to  the  airport  we  can  get 
more  people  off  quickly. 


Boost  weight 

1% 

2% 

3% 

4% 

5% 

Speedup  factor 

1.1 

1.3 

2.4 

2.0 

1.2 

Table  2:  Pilot  trials  on  dialog  3  to  determine  boost 
factor. 


limit.In  the  first  experiment,  this  upper  limit  is  set  at 
10000  constituents.  In  addition,  we  performed  the 
same  experiments  with  lower  upper  limits  to  explore 
the  question  of  how  much  of  the  parser  time  is  spent 
on  the  sentences  that  hit  the  maximum  chart  size 
limit.  In  the  second  experiment  we  used  an  upper 


limit  of  5000,  and  in  the  third  we  used  an  upper  limit 
of  1500  (the  standard  value  for  use  in  our  real-time 
dialog  system  to  avoid  long  delays  in  responding). 

4.2  Results 

Results  show  significant  improvements  in  the  speed 
of  parsing.  Table  3  shows  the  exact  match  sen¬ 
tence  accuracy  and  timing  results  for  parsing  with 
and  without  skeletons  with  a  maximum  chart  size 
of  10000.  The  first  row  shows  how  many  utterances 
of  6  words  or  longer  were  parsed  in  each  dialog. 
The  next  two  rows  show  exact  match  sentence  ac¬ 
curacy  results  for  parses  obtained  with  and  without 


Dialog 

1 

2 

4 

5 

Total 

Utts  (6+  words) 

83 

78 

78 

70 

309 

Sentence  accu¬ 
racy  w /  skeleton 

57.8 

50 

37.2 

52.9 

49.5 

Sentence  accu¬ 
racy  no  skeleton 

56.6 

48.7 

35.9 

52.9 

48.5 

Time  w /  skeleton 

46 

85 

127 

45 

303 

Time  no  skeleton 

90 

190 

321 

60 

661 

Speedup  Factor 

1.9 

2.2 

2.5 

1.3 

2.0 

Table  3:  Sentence  accuracy  and  timing  results  with 
maximum  chart  size  10000  for  utterances  of  6  or 
more  words. 


skeletons.  The  next  two  rows  show  the  total  time 
(in  seconds)  to  parse  the  dialogs  with  and  without 
the  skeletons.  The  last  row  shows  the  speed  up  fac¬ 
tor  (computed  as  time-without-skeletons/time-with- 
skeletons).6 

We  see  substantial  speed-ups  in  the  parser  using 
this  technique.  The  parser  using  skeletons  com¬ 
pleted  the  parses  in  less  than  half  of  the  time  of  the 
original  parser.  Looking  at  individual  utterances, 
70%  were  parsed  more  quickly  with  the  skeletons, 
while  25%  were  slower.  Overall,  our  simple  ap¬ 
proach  appeal's  to  provide  a  substantial  payoff  in 
speed  along  with  a  small  improvement  in  accuracy. 

Note  that  we  use  a  strict  criterion  for  accuracy, 
so  both  the  correct  logical  form  as  well  as  the  cor¬ 
rect  syntactic  structure  must  be  computed  by  the 
parser  for  an  analysis  to  be  considered  correct  in 
our  evaluation.  A  correct  logical  form  requires  cor¬ 
rect  word  sense  disambiguation,  constituent  depen¬ 
dencies,  and  semantic  role  assignment  (see  section 
2.2).  For  example,  in  some  cases  the  parser  pro¬ 
duces  a  structurally  correct  parse,  but  selects  an  in¬ 
appropriate  word  sense,  in  which  case  the  analysis 
is  considered  incorrect.  One  such  case  is  the  utter¬ 
ance  You  know  where  the  little  loop  is,  in  which  the 
where  is  assigned  the  sense  TO-LOC  (which  should 
only  be  used  for  trajectories,  as  in  Where  did  he  go), 
when  in  this  utterance  the  correct  sense  for  where  is 
SPATIAL-LOC. 

To  explore  the  question  of  how  much  of  the  speed 
increase  is  the  result  of  time  spent  on  difficult  sen¬ 
tences  that  cause  the  parser  to  reach  the  maximum 
chart  size  limit,  we  performed  the  same  experiment 
with  a  smaller  maximum  chart  size  of  5000,  shown 
in  Table  4.  As  expected  the  speed-up  gain  declined 
to  1.8,  still  quite  a  respectable  gain,  and  again  there 


'’These  experiments  were  run  with  CMU  Common  LISP 
18e  and  a  Linux  2.4.20  kernel  on  a  2  GHz  Xeon  dual  processor 
with  1.0  GB  total  memory. 


Dialog 

1 

2 

4 

5 

Total 

Utts  (6+  words) 

83 

78 

78 

70 

309 

Sentence  accu¬ 
racy  w /  skeleton 

57.8 

50 

37.2 

52.9 

49.5 

Sentence  accu¬ 
racy  no  skeleton 

55.4 

48.7 

35.9 

52.9 

48.2 

Time  w /  skeleton 

46 

82 

126 

45 

299 

Time  no  skeleton 

90 

148 

286 

59 

583 

Speedup  Factor 

1.9 

1.8 

2.3 

1.3 

1.8 

Table  4:  Sentence  accuracy  and  timing  results  with 
maximum  chart  size  5000  for  utterances  of  6  or 
more  words. 


Dialog 

1 

2 

4 

5 

Total 

Utts  (6+  words) 

83 

78 

78 

70 

309 

Sentence  accu¬ 
racy  w /  skeleton 

57.8 

48.7 

37.2 

52.9 

49.2 

Sentence  accu¬ 
racy  no  skeleton 

55.4 

47.4 

35.9 

52.9 

47.9 

Time  w /  skeleton 

47 

76 

109 

45 

277 

Time  no  skeleton 

74 

92 

150 

59 

375 

Speedup  Factor 

1.6 

1.2 

1.4 

1.3 

1.4 

Table  5:  Sentence  accuracy  and  timing  results  with 
maximum  chart  size  1500  for  utterances  of  6  more 
words. 


is  no  loss  of  accuracy. 

As  we  drop  the  chart  size  to  1500,  the  speed-up 
drops  to  just  1.4,  as  shown  in  Table  5.  However, 
we  have  improvements  in  accuracy  using  skeletons 
when  we  parse  with  low  upper  limits.  In  certain 
cases  the  skeleton  guides  the  parser  to  the  correct 
parse  more  quickly,  so  it  can  be  found  even  when 
the  maximum  chart  size  is  reduced.  For  example, 
for  the  utterance  And  meanwhile  we  send  two  am¬ 
bulances  from  the  Strong  Hospital  to  take  the  six 
wounded  people  from  the  airport  (from  dialog  1), 
a  correct  full  sentence  analysis  is  found  with  the 
larger  maximum  chart  sizes  (5000  or  more),  but 
with  a  maximum  chart  size  of  1500  the  correct  anal¬ 
ysis  for  this  utterance  is  found  only  with  the  help  of 
the  skeleton. 

Our  best  results  are  similar  to  those  reported  in 
(Frank  et  al.,  2002),  who  show  a  speed-up  factor 
of  2.26,  although  they  use  a  much  larger  maximum 
chart  size  (70,000).  Because  of  the  differences  in 
grammars  and  parsers,  it  is  not  clear  how  to  fairly 
compare  the  chart  sizes. 


5  Conclusion 

With  minimal  modifications  to  our  deep  parser,  we 
have  been  able  to  achieve  a  substantial  increase  in 
parsing  speed  with  this  technique  along  with  a  small 
increase  in  accuracy.  The  experiments  reported  here 
investigated  this  technique  using  off-line  methods. 
Given  our  promising  results,  we  arc  currently  work¬ 
ing  to  integrate  an  on-line  shallow  parsing  filter  into 
our  collaborative  dialog  assistant. 
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